Reading list for Tue, Apr. 9, 2024

Total: 777 from which: 3 suggested today and 520 expired

Today's reading list ±

3 links selected from 257 today Tue, Apr. 9, 2024
  1. Concurrency Freaks: 50 years later, is Two-Phase Locking the best we can do?

    Two phase locking (2PL) was the first of the general-purpose Concurrency Controls to be invented which provided Serializability. In fact, 2PL gives more than Serializability, it gives Opacity, a much stronger isolation l ... (12997 chars. See body)


    Two phase locking (2PL) was the first of the general-purpose Concurrency Controls to be invented which provided Serializability. In fact, 2PL gives more than Serializability, it gives Opacity, a much stronger isolation level.
    2PL was published in 1976, which incidentally is the year I was born, and it is likely that Jim Gray and his buddies had this idea long before it was published, which means 2PL first came to existence nearly 50 years ago.
    Jim Gray Endowed Scholarship | Paul G. Allen School of Computer Science &  Engineering

    After all that time has passed, is this the best we can do?

     Turns out no, we can do better with 2PLSF, but let's start from the begining

    When I use the term "general-purpose concurrency control" I mean an algorithm which allows access multiple objects (or records or tuples, whatever you want to name them) with an all-or-nothing semantics. In other words, an algorithm that lets you do transactions over multiple data items.
    Two-Phase Locking has several advantages over the other concurrency controls that have since been invented, but in my view there are two important ones: simplicity and a strong isolation level.

    In 2PL, before accessing a record for read or write access, we must first take the lock that protects this record.
    During the transaction, we keep acquiring locks for each access, and only at the end of the transaction, when we know that no further accesses will be made, can we release all the locks. Having an instant in time (i.e. the end of the transaction) where all the locks are taken on the data that we accessed, means that there is a linearization point for our operation (transaction), which means we have a consistent view of the different records and can write to other records all in a consistent way. It doesn't get much simpler than this.
    Today, this idea may sound embarrassingly obvious, but 50 years ago many database researchers thought that it was ok to release the locks after completing the access to the record. And yes, it is possible to do so, but such a concurrency control is not serializable.

    As for strong isolation, database researchers continue to invent concurrency controls that are not serializable, and write papers about it, which means Serializability is not that important for Databases. On the other hand, all transactional commercial databases that I know of, use 2PL or some combination of it with T/O or MVCC.

    Moreover, in the field of concurrency data structures, linearizability is the gold standard, which means 2PL is used heavily. If we need to write to multiple nodes of a data structure in a consistent way, we typically need something like 2PL, at least for the write accesses. The exception to this are lock-free data-structures, but hey, that's why (correct) lock-free is hard!

    Ok, 2PL is easy to use and has strong isolation, so this means we're ready to go and don't need anything better than 2PL, right?
    I'm afraid not. 2PL has a couple of big disadvantages: poor read-scalability and live-lock progress.

    The classic 2PL was designed for mutual exclusion locks, which means that when two threads are performing a read-access on the same record, they will conflict and one of them (or both) will abort and restart.
    This problem can be solved by replacing the mutual exclusion locks with reader-writer locks, but it's not as simple as this.
    Mutual exclusion locks can be implemented with a single bit, representing the state of locked or unlocked.
    Reader-writer locks also need this bit and in addition, need to have a counter of the number of readers currently holding the lock in read mode. This counter needs enough bits to represent the number of readers. For example, 7 bits means you can have a maximum of 128 threads in the system, in case they all decide to acquire the read-lock on a particular reader-writer lock instance.
    For such a scenario this implies that each lock would take 1 byte, which may not sound like much, but if you have billions of records in your DB then you will need billions of bytes for those locks. Still reasonable, but now we get into the problem of contention on the counter.

    Certain workloads have lots of read accesses on the same data, they are read-non-disjoint. An example of this is the root node of a binary search tree, where all operations need to read the root before they start descending the nodes of the tree.
    When using 2PL, each of these accesses on the root node implies a lock acquisition and even if we're using read-writer locks, it implies heavy contention on the lock that protects the root node.

    Previous approaches have taken a stab at this problem, for example TLRW by Dave Dice and Nir Shavit in SPAA 2010.
    By using reader-writer locks they were able to have much better performance than using mutual exclusion locks, but still far from what the optimistic concurrency controls can achieve.
    Take the example of the plot below where we have an implementation similar to TLRW with each read-access contending on a single variable of the reader-writer lock, applied to a binary search tree, a Rank-based Relaxed AVL. Scalability is flat regardless of whether we're doing mostly write-transactions (left side plot) or just read-transactions (right side plot).

    Turns out it is possible to overcome this "read-indicator contention" problem through the usage of scalable read-indicators. Our favorite algorithm is a reader-writer lock where each reader announces its arrival/departure on a separate cache line, thus having no contention for read-lock acquisition. The downside is that the thread taking the write-lock must scan through all those cache lines to be able to ascertain whether if the write-lock is granted, thus incurring a higher cost for the write-lock acquisition.
    As far as I know, the first reader-writer lock algorithms with this technique were shown in the paper "NUMA Aware reader-writer locks" of which Dave Dice and Nir Shavit are two of the authors, along with Irina Calciu, Yossi Lev, Victor Luchangco, and Virenda Marathe
    This paper shows three different reader-writer lock algorithms, two with high scalability, but neither is starvation-free.

    So what we did was take some of these ideas to make a better reader-writer lock, which also scales well for read-lock acquisition but has other properties, and we used this to implement our own concurrency control which we called Two-Phase Locking Starvation-Free (2PLSF).
    The reader-writer locks in 2PLSF have one bit per thread reserved for the read-lock but they are located in their own cache line, along with the bits (read-indicators) of the next adjacent locks.


    Like on the "NUMA-Aware reader-writer locks" paper, the cost shifts to the write-lock acquisition which needs to scan multiple cache lines to acquire the write-lock. There is no magic here, just trade-offs, but this turns out to be a pretty good trade-off as most workloads tend to be on the read-heavy side. Even write-intensive workloads spend a good amount of time executing read-accesses, for example, during the record lookup phase.
    With our improved reader-writer lock the same benchmark shown previously for the binary search tree looks very different:

    With this improved reader-writer lock we are able to scale 2PL even on read-non-disjoint workloads, but it still leaves out the other major disadvantage, 2PL is prone to live-lock.

    There are several variants of the original 2PL, some of these variants aren't even serializable, therefore I wouldn't call them 2PL anymore and won't bother going into that.
    For the classical 2PL, there are three variants and they are mostly about how to deal with contention. They're usually named:
        - No-Wait
        - Wait-Or-Die
        - Deadlock-detection

    When a conflict is encountered, the No-Wait variant aborts the self transaction (or the other transaction) and retries again. This retry can be immediate, or it can be later, based on an exponential backoff scheme. The No-Wait approach has live-lock progress because two transactions with one attempting to modify record A and then record B, while the other is attempting to modify record B and then record A, may indefinitely conflict with each other and abort-restart without any of them ever being able to commit.

    The Deadlock-Detection variant keeps an internal list of threads waiting on a lock and detects cycles (deadlocks).
    This is problematic for reader-writer locks because it would require each reader to have its own list, which itself needs a (mutual exclusion) lock to protect it. And detecting the cycles would mean scanning all the readers' lists when the lock is taken in read-lock mode.
    Theoretically it should be possible to make this scheme starvation-free, but it would require using starvation-free locks, and as there is no (published) highly scalable reader-writer lock with starvation-free progress, it kind of defeats the purpose. Moreover, having one list per reader may have consequences on high memory usage. Who knows, maybe one day someone will try this approach.

    The Wait-Or-Die variant imposes an order on all transactions, typically with a timestamp of when the transaction started and, when a lock conflict arises, decides to wait for the lock or to abort, by comparing the timestamp of the transaction with the timestamp of the lock owner. This works fine for mutual exclusion locks as the owner can be stored in the lock itself using a unique-thread identifier, but if we want to do it for reader-writer locks then a thread-id would be needed per reader.
    If we want to support 256 threads then this means we need 8 bits x 256 = 256 bytes per reader-writer lock. Using 256 bytes per lock is a hard pill to swallow!

    But memory usage is not the real obstacle here. The Wait-Or-Die approach implies that all transactions have a unique transaction id so as to order them, for example, they can take a number from an atomic variable using a fetch_and_add() instruction.
    The problem with this is that on most modern CPUs you won't be able to do more than 40 million fetch_and_add() operations per second on a contended atomic variable. This may seem like a lot (Visa does about 660 million transactions per day, so doing 40 million per second sounds pretty good), but when it comes to in-memory DBMS it's not that much, and particularly for concurrent data structures is a bit on the low side.
    Even worse, this atomic fetch_and_add() must be done for all transactions, whether they are write-transactions or read-transactions.
    For example, in one of our machines it's not really possible to go above 20 M fetch_and_add() per second, which means that scalability suckz:

    To put this in perspective, one of my favorite concurrency controls is TL2 which was invented by (surprise!) none other than Dave Dice, Nir Shavit and Ori Shalev
    I hope by now you know who are the experts in this stuff  ;)

    Anyways, in TL2 the read-transactions don't need to do an atomic fetch_and_add(), and they execute optimistic reads, which is faster than any read-lock acquisition you can think of. At least for read-transactions, TL2 can scale to hundreds of millions of transactions per second. By comparison, 2PL with Wait-Or-Die will never be able to go above 40 M tps/sec.
    This means if high scalability is your goal, then you would be better off with TL2 than 2PL… except, 2PLSF solves this problem too.

    In 2PLSF only the transactions that go into conflict need to be ordered, i.e. only these need to do a fetch_and_add() on a central atomic variable. This has two benefits: it means there is less contention on the central atomic variable that assigns the unique transaction id, and it means that transactions without conflicts are not bounded by the 40 M tps plateau.
    This means that we can have 200 M tps running without conflict and then 40 M tps that are having conflict, because the conflicting transactions are the only ones that need to do the fetch_and_add() and therefore, and the only ones bounded by the maximum number of fetch_and_adds() the CPU can execute per second.
    On top of this, the 2PLSF algorithm provides starvation-freedom.

    Summary

    In this post we saw some of the advantages and disadvantages of 2PL and some of the variants of 2PL.
    We explained what it takes to scale 2PL: make a better reader-writer lock.
    But the big disadvantage of 2PL is the live-lock progress, which some variants could seemingly resolve, but in practice they don't because they will not scale, even with a better reader-writer lock.
    Then we described 2PLSF, a novel algorithm invented by me, Andreia and Pascal Felber to address these issues.

    In summary, 2PLSF is what 2PL should have been from the start, a concurrency control that scales well even when reads are non-disjoint and that provides starvation-free transactions, the highest form of blocking progress there is.
    Moreover, it's pretty good a solving certain kinds of conflicts, which means it can have scalability even some conflicts arise. 2PLSF is not perfect, but it's good enough, certainly better than TL2 when it comes to solving conflicts.

    Despite being two-phase locking, it's as close to 2PL as a jackhammer is to a pickaxe. 

    2PLSF is not your grandfather's 2PL



  2. The Secret To Life Is Not Taking It So Seriously

    There is something about age and experience that shines a powerful light on the absurdity of life. It’s as if, after enough trips around the sun, we begin to see patterns of history repeating itself… ... (318 chars. See body)


    PHILOSOPHY

    The search for meaning in life is a fool’s errand and a recipe for unhappiness and disillusionment

    David Todd McCarty

    Ellemeno

    Published in

    7 min read

    Oct 27, 2022

    --

    Photo by Karsten Winegeart on Unsplash

    “Life has become immeasurably better since I have been forced to stop taking it seriously.”
    ― Hunter S


  3. Confusing git terminology

    Confusing git terminology ... (21273 chars. See body)


    Hello! I’m slowly working on explaining git. One of my biggest problems is that after almost 15 years of using git, I’ve become very used to git’s idiosyncracies and it’s easy for me to forget what’s confusing about it.

    So I asked people on Mastodon:

    what git jargon do you find confusing? thinking of writing a blog post that explains some of git’s weirder terminology: “detached HEAD state”, “fast-forward”, “index/staging area/staged”, “ahead of ‘origin/main’ by 1 commit”, etc

    I got a lot of GREAT answers and I’ll try to summarize some of them here. Here’s a list of the terms:

    I’ve done my best to explain what’s going on with these terms, but they cover basically every single major feature of git which is definitely too much for a single blog post so it’s pretty patchy in some places.

    HEAD and “heads”

    A few people said they were confused by the terms HEAD and refs/heads/main, because it sounds like it’s some complicated technical internal thing.

    Here’s a quick summary:

    • “heads” are “branches”. Internally in git, branches are stored in a directory called .git/refs/heads. (technically the official git glossary says that the branch is all the commits on it and the head is just the most recent commit, but they’re 2 different ways to think about the same thing)
    • HEAD is the current branch. It’s stored in .git/HEAD.

    I think that “a head is a branch, HEAD is the current branch” is a good candidate for the weirdest terminology choice in git, but it’s definitely too late for a clearer naming scheme so let’s move on.

    There are some important exceptions to “HEAD is the current branch”, which we’ll talk about next.

    “detached HEAD state”

    You’ve probably seen this message:

    $ git checkout v0.1
    You are in 'detached HEAD' state. You can look around, make experimental
    changes and commit them, and you can discard any commits you make in this
    state without impacting any branches by switching back to a branch.
    
    [...]
    

    Here’s the deal with this message:

    • In Git, usually you have a “current branch” checked out, for example main.
    • The place the current branch is stored is called HEAD.
    • Any new commits you make will get added to your current branch, and if you run git merge other_branch, that will also affect your current branch
    • But HEAD doesn’t have to be a branch! Instead it can be a commit ID.
    • Git calls this state (where HEAD is a commit ID instead of a branch) “detached HEAD state”
    • For example, you can get into detached HEAD state by checking out a tag, because a tag isn’t a branch
    • if you don’t have a current branch, a bunch of things break:
      • git pull doesn’t work at all (since the whole point of it is to update your current branch)
      • neither does git push unless you use it in a special way
      • git commit, git merge, git rebase, and git cherry-pick do still work, but they’ll leave you with “orphaned” commits that aren’t connected to any branch, so those commits will be hard to find
    • You can get out of detached HEAD state by either creating a new branch or switching to an existing branch

    “ours” and “theirs” while merging or rebasing

    If you have a merge conflict, you can run git checkout --ours file.txt to pick the version of file.txt from the “ours” side. But which side is “ours” and which side is “theirs”?

    I always find this confusing and I never use git checkout --ours because of that, but I looked it up to see which is which.

    For merges, here’s how it works: the current branch is “ours” and the branch you’re merging in is “theirs”, like this. Seems reasonable.

    $ git checkout merge-into-ours # current branch is "ours"
    $ git merge from-theirs # branch we're merging in is "theirs"
    

    For rebases it’s the opposite – the current branch is “theirs” and the target branch we’re rebasing onto is “ours”, like this:

    $ git checkout theirs # current branch is "theirs"
    $ git rebase ours # branch we're rebasing onto is "ours"
    

    I think the reason for this is that under the hood git rebase main is repeatedly merging commits from the current branch into a copy of the main branch (you can see what I mean by that in this weird shell script the implements git rebase using git merge. But I still find it confusing.

    This nice tiny site explains the “ours” and “theirs” terms.

    A couple of people also mentioned that VSCode calls “ours”/“theirs” “current change”/“incoming change”, and that it’s confusing in the exact same way.

    “Your branch is up to date with ‘origin/main’”

    This message seems straightforward – it’s saying that your main branch is up to date with the origin!

    But it’s actually a little misleading. You might think that this means that your main branch is up to date. It doesn’t. What it actually means is – if you last ran git fetch or git pull 5 days ago, then your main branch is up to date with all the changes as of 5 days ago.

    So if you don’t realize that, it can give you a false sense of security.

    I think git could theoretically give you a more useful message like “is up to date with the origin’s main as of your last fetch 5 days ago” because the time that the most recent fetch happened is stored in the reflog, but it doesn’t.

    HEAD^, HEAD~ HEAD^^, HEAD~~, HEAD^2, HEAD~2

    I’ve known for a long time that HEAD^ refers to the previous commit, but I’ve been confused for a long time about the difference between HEAD~ and HEAD^.

    I looked it up, and here’s how these relate to each other:

    • HEAD^ and HEAD~ are the same thing (1 commit ago)
    • HEAD^^^ and HEAD~~~ and HEAD~3 are the same thing (3 commits ago)
    • HEAD^3 refers the the third parent of a commit, and is different from HEAD~3

    This seems weird – why are HEAD~ and HEAD^ the same thing? And what’s the “third parent”? Is that the same thing as the parent’s parent’s parent? (spoiler: it isn’t) Let’s talk about it!

    Most commits have only one parent. But merge commits have multiple parents – they’re merging together 2 or more commits. In Git HEAD^ means “the parent of the HEAD commit”. But what if HEAD is a merge commit? What does HEAD^ refer to?

    The answer is that HEAD^ refers to the the first parent of the merge, HEAD^2 is the second parent, HEAD^3 is the third parent, etc.

    But I guess they also wanted a way to refer to “3 commits ago”, so HEAD^3 is the third parent of the current commit (which may have many parents if it’s a merge commit), and HEAD~3 is the parent’s parent’s parent.

    I think in the context of the merge commit ours/theirs discussion earlier, HEAD^ is “ours” and HEAD^2 is “theirs”.

    .. and ...

    Here are two commands:

    • git log main..test
    • git log main...test

    What’s the difference between .. and ...? I never use these so I had to look it up in man git-range-diff. It seems like the answer is that in this case:

    A - B main
      \ 
        C - D test
    
    • main..test is commits C and D
    • test..main is commit B
    • main...test is commits B, C, and D

    But it gets worse: apparently git diff also supports .. and ..., but they do something completely different than they do with git log? I think the summary is:

    • git log test..main shows changes on main that aren’t on test, whereas git log test...main shows changes on both sides.
    • git diff test..main shows test changes and main changes (it diffs B and D) whereas git diff test...main diffs A and D (it only shows you the diff on one side).

    this blog post talks about it a bit more.

    “can be fast-forwarded”

    Here’s a very common message you’ll see in git status:

    $ git status
    On branch main
    Your branch is behind 'origin/main' by 2 commits, and can be fast-forwarded.
      (use "git pull" to update your local branch)
    

    What does “fast-forwarded” mean? Basically it’s trying to say that the two branches look something like this: (newest commits are on the right)

    main:        A - B - C
    origin/main: A - B - C - D - E
    

    or visualized another way:

    A - B - C - D - E (origin/main)
            |
           main
    

    Here origin/main just has 2 extra commits that main doesn’t have, so it’s easy to bring main up to date – we just need to add those 2 commits. Literally nothing can possibly go wrong – there’s no possibility of merge conflicts. A fast forward merge is a very good thing! It’s the easiest way to combine 2 branches.

    After running git pull, you’ll end up this state:

    main:        A - B - C - D - E
    origin/main: A - B - C - D - E
    

    Here’s an example of a state which can’t be fast-forwarded.

                 A - B - C - X  (main)
                         |
                         - - D - E  (origin/main)
    

    Here main has a commit that origin/main doesn’t have (X). So you can’t do a fast forward. In that case, git status would say:

    $ git status
    Your branch and 'origin/main' have diverged,
    and have 1 and 2 different commits each, respectively.
    

    “reference”, “symbolic reference”

    I’ve always found the term “reference” kind of confusing. There are at least 3 things that get called “references” in git

    • branches and tags like main and v0.2
    • HEAD, which is the current branch
    • things like HEAD^^^ which git will resolve to a commit ID. Technically these are probably not “references”, I guess git calls them “revision parameters” but I’ve never used that term.

    “symbolic reference” is a very weird term to me because personally I think the only symbolic reference I’ve ever used is HEAD (the current branch), and HEAD has a very central place in git (most of git’s core commands’ behaviour depends on the value of HEAD), so I’m not sure what the point of having it as a generic concept is.

    refspecs

    When you configure a git remote in .git/config, there’s this +refs/heads/main:refs/remotes/origin/main thing.

    [remote "origin"]
    	url = git@github.com:jvns/pandas-cookbook
    	fetch = +refs/heads/main:refs/remotes/origin/main
    

    I don’t really know what this means, I’ve always just used whatever the default is when you do a git clone or git remote add, and I’ve never felt any motivation to learn about it or change it from the default.

    “tree-ish”

    The man page for git checkout says:

     git checkout [-f|--ours|--theirs|-m|--conflict=<style>] [<tree-ish>] [--] <pathspec>...
    

    What’s tree-ish??? What git is trying to say here is when you run git checkout THING ., THING can be either:

    • a commit ID (like 182cd3f)
    • a reference to a commit ID (like main or HEAD^^ or v0.3.2)
    • a subdirectory inside a commit (like main:./docs)
    • I think that’s it????

    Personally I’ve never used the “directory inside a commit” thing and from my perspective “tree-ish” might as well just mean “commit or reference to commit”.

    “index”, “staged”, “cached”

    All of these refer to the exact same thing (the file .git/index, which is where your changes are staged when you run git add):

    • git diff --cached
    • git rm --cached
    • git diff --staged
    • the file .git/index

    Even though they all ultimately refer to the same file, there’s some variation in how those terms are used in practice:

    • Apparently the flags --index and --cached do not generally mean the same thing. I have personally never used the --index flag so I’m not going to get into it, but this blog post by Junio Hamano (git’s lead maintainer) explains all the gnarly details
    • the “index” lists untracked files (I guess for performance reasons) but you don’t usually think of the “staging area” as including untracked files”

    “reset”, “revert”, “restore”

    A bunch of people mentioned that “reset”, “revert” and “restore” are very similar words and it’s hard to differentiate them.

    I think it’s made worse because

    • git reset --hard and git restore . on their own do basically the same thing. (though git reset --hard COMMIT and git restore --source COMMIT . are completely different from each other)
    • the respective man pages don’t give very helpful descriptions:
      • git reset: “Reset current HEAD to the specified state”
      • git revert: “Revert some existing commits”
      • git restore: “Restore working tree files”

    Those short descriptions do give you a better sense for which noun is being affected (“current HEAD”, “some commits”, “working tree files”) but they assume you know what “reset”, “revert” and “restore” mean in this context.

    Here are some short descriptions of what they each do:

    • git revert COMMIT: Create a new commit that’s the “opposite” of COMMIT on your current branch (if COMMIT added 3 lines, the new commit will delete those 3 lines)
    • git reset --hard COMMIT: Force your current branch back to the state it was at COMMIT, erasing any new changes since COMMIT. Very dangerous operation.
    • git restore --source=COMMIT PATH: Take all the files in PATH back to how they were at COMMIT, without changing any other files or commit history.

    “untracked files”, “remote-tracking branch”, “track remote branch”

    Git uses the word “track” in 3 different related ways:

    • Untracked files: in the output of git status. This means those files aren’t managed by Git and won’t be included in commits.
    • a “remote tracking branch” like origin/main. This is a local reference, and it’s the commit ID that main pointed to on the remote origin the last time you ran git pull or git fetch.
    • “branch foo set up to track remote branch bar from origin”

    The “untracked files” and “remote tracking branch” thing is not too bad – they both use “track”, but the context is very different. No big deal. But I think the other two uses of “track” are actually quite confusing:

    • main is a branch that tracks a remote
    • origin/main is a remote-tracking branch

    But a “branch that tracks a remote” and a “remote-tracking branch” are different things in Git and the distinction is pretty important! Here’s a quick summary of the differences:

    • main is a branch. You can make commits to it, merge into it, etc. It’s often configured to “track” the remote main in .git/config, which means that you can use git pull and git push to push/pull changes.
    • origin/main is not a branch. It’s a “remote-tracking branch”, which is not a kind of branch (I’m sorry). You can’t make commits to it. The only way you can update it is by running git pull or git fetch to get the latest state of main from the remote.

    I’d never really thought about this ambiguity before but I think it’s pretty easy to see why folks are confused by it.

    checkout

    Checkout does two totally unrelated things:

    • git checkout BRANCH switches branches
    • git checkout file.txt discards your unstaged changes to file.txt

    This is well known to be confusing and git has actually split those two functions into git switch and git restore (though you can still use checkout if, like me, you have 15 years of muscle memory around git checkout that you don’t feel like unlearning)

    Also personally after 15 years I still can’t remember the order of the arguments to git checkout main file.txt for restoring the version of file.txt from the main branch.

    I think sometimes you need to pass -- to checkout as an argument somewhere to help it figure out which argument is a branch and which ones are paths but I never do that and I’m not sure when it’s needed.

    reflog

    Lots of people mentioning reading reflog as re-flog and not ref-log. I won’t get deep into the reflog here because this post is REALLY long but:

    • “reference” is an umbrella term git uses for branches, tags, and HEAD
    • the reference log (“reflog”) gives you the history of everything a reference has ever pointed to
    • It can help get you out of some VERY bad git situations, like if you accidentally delete an important branch
    • I find it one of the most confusing parts of git’s UI and I try to avoid needing to use it.

    merge vs rebase vs cherry-pick

    A bunch of people mentioned being confused about the difference between merge and rebase and not understanding what the “base” in rebase was supposed to be.

    I’ll try to summarize them very briefly here, but I don’t think these 1-line explanations are that useful because people structure their workflows around merge / rebase in pretty different ways and to really understand merge/rebase you need to understand the workflows. Also pictures really help. That could really be its whole own blog post though so I’m not going to get into it.

    • merge creates a single new commit that merges the 2 branches
    • rebase copies commits on the current branch to the target branch, one at a time.
    • cherry-pick is similar to rebase, but with a totally different syntax (one big difference is that rebase copies commits FROM the current branch, cherry-pick copies commits TO the current branch)

    rebase --onto

    git rebase has an flag called onto. This has always seemed confusing to me because the whole point of git rebase main is to rebase the current branch onto main. So what’s the extra onto argument about?

    I looked it up, and --onto definitely solves a problem that I’ve rarely/never actually had, but I guess I’ll write down my understanding of it anyway.

    A - B - C (main)
         \
          D - E - F - G (mybranch)
              | 
              otherbranch
    

    Imagine that for some reason I just want to move commits F and G to be rebased on top of main. I think there’s probably some git workflow where this comes up a lot.

    Apparently you can run git rebase --onto main otherbranch mybranch to do that. It seems impossible to me to remember the syntax for this (there are 3 different branch names involved, which for me is too many), but I heard about it from a bunch of people so I guess it must be useful.

    commit

    Someone mentioned that they found it confusing that commit is used both as a verb and a noun in git.

    for example:

    • verb: “Remember to commit often”
    • noun: “the most recent commit on main

    My guess is that most folks get used to this relatively quickly, but this use of “commit” is different from how it’s used in SQL databases, where I think “commit” is just a verb (you “COMMIT” to end a transaction) and not a noun.

    Also in git you can think of a Git commit in 3 different ways:

    1. a snapshot of the current state of every file
    2. a diff from the parent commit
    3. a history of every previous commit

    None of those are wrong: different commands use commits in all of these ways. For example git show treats a commit as a diff, git log treats it as a history, and git restore treats it as a snapshot.

    But git’s terminology doesn’t do much to help you understand in which sense a commit is being used by a given command.

    more confusing terms

    Here are a bunch more confusing terms. I don’t know what a lot of these mean.

    things I don’t really understand myself:

    • “the git pickaxe” (maybe this is git log -S and git log -G, for searching the diffs of previous commits?)
    • submodules (all I know is that they don’t work the way I want them to work)
    • “cone mode” in git sparse checkout (no idea what this is but someone mentioned it)

    things that people mentioned finding confusing but that I left out of this post because it was already 3000 words:

    • blob, tree
    • the direction of “merge”
    • “origin”, “upstream”, “downstream”
    • that push and pull aren’t opposites
    • the relationship between fetch and pull (pull = fetch + merge)
    • git porcelain
    • subtrees
    • worktrees
    • the stash
    • “master” or “main” (it sounds like it has a special meaning inside git but it doesn’t)
    • when you need to use origin main (like git push origin main) vs origin/main

    github terms people mentioned being confused by:

    • “pull request” (vs “merge request” in gitlab which folks seemed to think was clearer)
    • what “squash and merge” and “rebase and merge” do (I’d never actually heard of git merge --squash until yesterday, I thought “squash and merge” was a special github feature)

    it’s genuinely “every git term”

    I was surprised that basically every other core feature of git was mentioned by at least one person as being confusing in some way. I’d be interested in hearing more examples of confusing git terms that I missed too.

    There’s another great post about this from 2012 called the most confusing git terminology. It talks more about how git’s terminology relates to CVS and Subversion’s terminology.

    If I had to pick the 3 most confusing git terms, I think right now I’d pick:

    • a head is a branch, HEAD is the current branch
    • “remote tracking branch” and “branch that tracks a remote” being different things
    • how “index”, “staged”, “cached” all refer to the same thing

    that’s all!

    I learned a lot from writing this – I learned a few new facts about git, but more importantly I feel like I have a slightly better sense now for what someone might mean when they say that everything in git is confusing.

    I really hadn’t thought about a lot of these issues before – like I’d never realized how “tracking” is used in such a weird way when discussing branches.

    Also as usual I might have made some mistakes, especially since I ended up in a bunch of corners of git that I hadn’t visited before.

    Also a very quick plug: I’m working on writing a zine about git, if you’re interested in getting an email when it comes out you can sign up to my very infrequent announcements mailing list.

    translations of this post


Random expired ±

2 links selected from 520 expired links

Expired ±

520 links expired today Tue, Apr. 9, 2024
  1. Manhattan Phoenix review: epic history of how New York was forged by fire – and water | Books | The Guardian

    Daniel Levy pins the great fire of 1835 as the birth event of modern Manhattan in a tale as teeming as the city itself (Open link)

  2. here

    Sample code and instructions for steps through different container image build options. - GitHub - maeddes/options-galore-container-build: Sample code and instructions for steps through different container image build op (Open link)

  3. The Go Memory Model

    Table of ContentsIntroductionAdviceInformal OverviewMemory ModelImplementation Restrictions for Programs Containing Data RacesSynchronizationInitializationGoroutine creationGoroutine destructionChannel communicationLocks (Open link)

  4. You and your mind garden

    In French, “cultiver son jardin intérieur” means to tend to your internal garden—to take care of your mind. The garden metaphor is particularly apt: taking care of your mind involves cultivating your curiosity (the seeds (Open link)

  5. Not your grandfather’s logs — A Java library’s new approach to observability | by Roni Dover | Apr, 2023 | Better Programming

    A while back, I wrote about the fact that logs need an overhaul, and that practices that were relevant when logs were still text messages in files may no longer be relevant in an age when logs traces… (Open link)

  6. The Kaizen Way

    Are you looking for a new approach to health? Do you want to finally get the results you have been hoping for? How do you find a practitioner that is willing to try a different approach and guide you through your journey (Open link)

  7. Owen Pomery

    OWEN D. POMERY WORK SHOP ABOUT + CONTACT EDITORIAL RELIEFS EDITION AIRBNB FLAT EYE ERNEST POHA HOUSE CONCEPT SENET VICTORY POINT KIOSK SCI-FI SPOT ILLUSTRATIONS GAME OF THRONES (GAME) NARRATIVE ARCHIT (Open link)

  8. Effective Go

    Table of ContentsIntroductionExamplesFormattingCommentaryNamesPackage namesGettersInterface namesMixedCapsSemicolonsControl structuresIfRedeclaration and reassignmentForSwitchType switchFunctionsMultiple return valuesNam (Open link)

  9. https://rust-book.cs.brown.edu/

    Welcome to the Rust Book experiment, and thank you for your participation! First, we want to introduce you to the new mechanics of this experiment. The main mechanic is quizzes: each page has a few quizzes about the pag (Open link)

  10. Gonzo journalism

    The "Gonzo fist", characterized by two thumbs and four fingers holding a peyote button, was originally used in Hunter S. Thompson's 1970 campaign for sheriff of Pitkin County, Colorado. It has since evolved into a symbol (Open link)

  11. Top 10 Architecture Characteristics / Non-Functional Requirements with Cheatsheet | by Love Sharma | Jun, 2022 | Dev Genius

    Imagine you are buying a car. What essential features do you need in it? A vehicle should deliver a person from point A to point B. But what we also check in it is Safety, Comfort, Maintainability… (Open link)

  12. Diátaxis

    The Diátaxis framework solves the problem of structure in technical documentation, making it easier to create, maintain and use. (Open link)

  13. User space

    For the term "user space" as used in Wikipedia, see Wikipedia:User pages. "Kernel space" redirects here. For the mathematical definition, see Null space. This article needs additional citations for verification. Please h (Open link)

  14. The Book

    This is the story of Simon Wardley. Follow his journey from bumbling and confused CEO lost in the headlights of change to someone with a vague idea of what they're doing. (Open link)

  15. Let's talk SkipList

    BackgroundSkipLists often come up when discussing “obscure” data-structures but in reality they are not that obscure, in fact many of the production grade softwares actively use them. In this post I’ll try to go into Ski (Open link)

  16. https://programmerweekly.us2.list-manage.com/track/click?u=72f68dcee17c92724bc7822fb&id=c6a9958764&e=d7c3968f32

    Ever since I started to work on the Apache APISIX project, I’ve been trying to improve my knowledge and understanding of REST RESTful HTTP APIs. For this, I’m reading and watching the following sources: Books. At the mom (Open link)

  17. Overheard at QCon 2022: Navigate complex environment and evolving relationships | by McKinsey Digital | McKinsey Digital Insights | Jan, 2023 | Medium

    We recently had the pleasure of participating in QCon, a global conference that gathers the best engineers from top-notch innovation companies. The event covers a wide range of relevant software… (Open link)

  18. What's in a Good Error Message?

    In a way, an error message tells a story; and as with every good story, you need to establish some context about its general settings. For an error message, this should tell the recipient what the code in question was tr (Open link)

  19. A Better Way to Manage Projects

    The GOVNO framework is a novel approach to project management that aims to improve upon the shortcomings of the popular scrum methodology. Each letter of the acronym represents a key aspect of the framework: G: Governan (Open link)

  20. The Grug Brained Developer

    Introduction this collection of thoughts on software development gathered by grug brain developer grug brain developer not so smart, but grug brain developer program many long year and learn some things although mostly s (Open link)

  21. In Praise Of Anchovies: If You Don’t Already Love Them, You Just Haven’t Yet Discovered How Good They Can Be

    For many people, anchovies are one of those foods to be avoided like the plague. But for Ken Gargett anchovies are not a love-it-or-hate it food. Rather, they are a love-it-or-you-have-not-discovered-how-good-they-can-be (Open link)

  22. Maintaining a medium-sized Java library in 2022 and beyond

    scc --exclude-dir docs/book ─────────────────────────────────────────────────────────────────────────────── Language Files Lines Blanks Comments Code Complexity ──────────────────────────────── (Open link)

  23. What's in a Good Error Message? - Gunnar Morling

    In a way, an error message tells a story; and as with every good story, you need to establish some context about its general settings. For an error message, this should tell the recipient what the code in question was tr (Open link)

  24. Beyond Microservices: Streams, State and Scalability

    Gwen Shapira talks about how microservices evolved in the last few years, based on experience gained while working with companies using Apache Kafka to update their application architecture. (Open link)

  25. Shields Down

    Resignations happen in a moment, and it’s not when you declare, “I’m resigning.” The moment happened a long time ago when you received a random email from a good friend who asked, “I know you’re really happy with your cu (Open link)

  26. Convey

    2022 February 01 16:21 stuartscott 1473754¤ 1240149¤ You may have noticed that the January edition of the Convey Digest looks a little different from the previous ones - the color scheme is now based on the dominant (Open link)

  27. 7 days ago

    This article has a large gap in the story: it ignores sensor data sources, which are both the highest velocity and highest volume data models by multiple orders of magnitude. They have become ubiquitous in diverse, mediu (Open link)

  28. Optimizing Distributed Joins: The Case of Google Cloud Spanner and DataStax Astra DB | by DataStax | Building Real-World, Real-Time AI | Medium

    In this post, learn how relational and NoSQL databases, Google Cloud Spanner and DataStax Astra DB, optimize distributed joins for real-time applications. Distributed joins are commonly considered to… (Open link)

  29. The Four Innovation Phases of Netflix’s Trillions Scale Real-time Data Infrastructure | by Zhenzhong Xu | Feb, 2022 | Medium

    The blog post will share the four phases of Real-time Data Infrastructure’s iterative journey in Netflix (2015-2021). For each phase, we will go over the evolving business motivations, the team’s unique challenges, the (Open link)

  30. Bicycle

    There is something delightful about riding a bicycle. Once mastered, the simple action of pedaling to move forward and turning the handlebars to steer makes bike riding an effortless activity. In the demonstration below, (Open link)